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ABSTRACT 

In this paper we analyze and compare the performance on a hypercube mul- 
tiprocessor of some of the major multigrid techniques used in practice. The model 
problem considered here is that of solving the 2-D incompressible Navier-Stokes 
equations representing the flow between two parallel plates. Results obtained by 
implementing the different multigrid schemes on an iPSC are presented. Effects on 
the overall performance of various parameters of the algorithms, of the partitioning 
strategies employed, and of some of the characteristics of the underlying architecture 
are discussed. 
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1. Introduction 

Multigrid algorithms are found to be optimal and efficient for solving a large 
class of problems involving partial differential equations on sequential machines. 
Recently there has been increased interest in parallelizing these algorithms. Although 
the multigrid methods exhibit a high degree of parallelism in the individual operas 
tions involved, it is not necessarily true that these algorithms perform optimally on 
multiprocessor systems as well. The performance of a multiprocessor system solving 
a given problem depends on several parameters. These include architecture depen- 
dent parameters, algorithm dependent parameters, and implementation dependent 
parameters. In this paper we discuss some of the performance issues involved in the 
parallel implementation of these algorithms and present some experimental results 
obtained by solving the 2-D Navier-Stokes equations for incompressible fluid flow on 
the Intel’s Personal Supercomputer (iPSC). All the experimental results presented 
here are obtained with the Release 3.0 iPSC operating system. In the following sec- 
tion we briefly describe the idea behind the multigrid methods and present three 
different algorithms based on this idea. In Section 3 the model problem is given. In 
Section 4 the performance issues involved are discussed and some experimental 
results are presented. Conclusions are given in Section 5. 

2. Multigrid Algorithms 

Consider a differential equation given by LU = F with boundary conditions 
BU = g defined on an n -dimensional domain in R n . For simplicity of exposition 
let L be an elliptic operator. Let the difference scheme L* U k = F h with boundary 
condition B k U k = g k , approximate this differential equation. 
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Now suppose we are solving the difference scheme by relaxation (Gauss-Seidel in 
lexicographic order, for instance). The error here can be written as e B * — U k - «„* 
where, «„* is the current approximation after the n-th relaxation sweep. Now con- 
sider the ratio ft„ — ||e»* + i || / ||e*|| where, || • || denotes the L^norm. From numer- 
ical experiments it is seen that the above ratio increases with n from some value 
/i 0 < 1 and approaches a number that may be very close to one. That is, conver- 
gence is fast in the first few steps and then slows down. 

A closer study reveals that whenever the error e„* is not smooth, n n is small, 
giving a good convergence rate. When e k is smooth the resulting convergence rate 
becomes poor. That is, relaxation smoothes the error. The main idea of multigrid is 
this: if the error is smooth , approximate it by a coarse grid, say of mesh size 2 h . 
Applying this idea recursively one arrives at a multigrid algorithm. It involves relax- 
ation sweeps on all levels, transfer of residuals from a fine to coarse level, and inter- 
polation of correction from coarse to fine level. An important property of such an 
algorithm is that the rate of convergence remains independent of the size of the 
problem if the order and the frequency with which the grids are visited are chosen 
properly. 

The recursive idea described above for reducing the error in the solution of the 
problem gives rise to a cyclic order of computation and these are referred to as the 
multigrid cycles. Different multigrid algorithms have been developed depending on 
the order and the frequency with which the grids are visited within a cycle. The 
most commonly used ones are the V and W cycles. In addition to these two types 
there is another type of cycle called F cycle which is less well known. A scheme called 
Full MultiGrid (FMG), in which the first approximation on the fine level is obtained 
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Fig. 1 (a) V Cycle (b) F Cycle 

by solving a similar problem on a coarser level, yields optimal performance. A 
detailed description of the various multigrid techniques and the algorithms based on 
these cycles is given in [Brandt 1984]. For the sake of clarity, we will refer to the 
algorithms based on these three cycles as the V, F, and W algorithms. All three algo- 
rithms make use of the FMG scheme. Furthermore, the basic multigrid operations in 
these algorithms remain the same, but the number of times a given grid is visited 
within a multigrid cycle is different. We will see that the relative performance of 
these three algorithms on multiprocessor systems is not the same as on the sequential 
machines. 

From the performance point of view the parameters that characterize these 
algorithms are the amount of computational work done per cycle, the number of 
cycles required for achieving the desired accuracy, and the order and the relative fre- 
quency with which different grids are visited. The last parameter is not so important 
for the sequential implementations but is a crucial factor for parallel applications. 
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Fig. 2 W Cycle 

If w L denotes the total amount of computational work on the highest level L 
then it can be shown that the total work per FMG cycle for the V algorithm is 

asymptotically less than or equal to —•«>£, . It is less than or equal to and 

Q 27 

4 w l for the F and W algorithms, respectively. These calculations assume that the 
number of relaxation sweeps on each grid is a small constant. If the number of mul- 
tigrid cycles necessary to achieve the desired accuracy is also a small constant then 
all three algorithms perform optimally, i.e., in time 0(w L ), on sequential machines. 

Characterizing the convergence properties of these algorithms is difficult, but 
experimental results suggest that in general the convergence rates per cycle for the 
W algorithm are the best and those of the V algorithm are mediocre. The F algo- 
rithm is somewhere between the two, usually slightly worse than the W algorithm. 
So the W algorithm is almost always preferred over the other two on sequential 



machines. 
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Figures 1 and 2 illustrate the order in which the three algorithms visit the 
different levels within a cycle. Here the increasing level numbers indicate finer mesh 
sizes. The letters within circles denote the number of relaxation sweeps on the 
corresponding levels. On levels with empty circles r l + r 2 number of relaxations are 
performed. It can be easily verified that for the V algorithm each level is visited 
exactly once within a multigrid cycle. For the F algorithm, level i is visited / - * + 1 
times where / is the highest level for that cycle. For the W algorithm, level »' is 
visited 2 <_ ’ times. Thus, the number of visits to the coarsest level grows exponen- 
tially for the W algorithm, whereas for the F algorithm it grows linearly. The 
significance of this property will become clearer in Section 4 where we discuss perfor- 
mance issues. 

3. Model Problem 

The model problem considered here is that of solving the 2-D steady state 
incompressible Navier-Stokes equations. Such equations arise, for example, in study- 
ing the fully developed flow between two parallel plates where one plate may be 
moving with respect to the other plate. Their solutions present some of the 
difficulties involved in solving real life problems, but at the same time are simple 
enough for experimentation on currently available multiprocessor systems. 

The equations in terms of vorticity w and stream function tp are: 

Aip = w 

« u, + r w. = -jj— Aw. 

Ke 

Re is the Reynold’s number of the fluid flow and « and v are the velocity com- 
ponents in the X and Y directions, respectively. The velocity components are given 



6 


in terms of the stream function ^ by, 

« = - tl> 9 and ti = t/>, . 

If the computational domain is Q = {(ar , y ) | 0 < x <1, 0< y < l} and if 
U 0 is the velocity of the moving plate, then the boundary conditions for such a flow 
are given by, 

u = v — 0 at y = 0 

u = U 0 , v = 0 at y = 1 
Periodicity is imposed in the X direction and 
a constant pressure gradient is imposed on the flow. 

The details of discretizing and solving these equations using the multigrid methods 
on the hypercube multiprocessor system are given elsewhere 1 . 

4. Performance Issues in the Parallel Implementation 

There are several parameters that affect the performance of a multiprocessor 
system employed to solve a given problem. These parameters are algorithm and 
implementation dependent as well as architecture dependent. One must take into 
account all these parameters before making a decision about the suitability of a par- 
ticular algorithm for solving a problem on a multiprocessor system. In the following, 
we describe the interaction of some of these parameters. 

4-1 Partitioning Scheme 

First we describe the effect of partitioning the domain on the distribution of the 
computational load. For the sake of simplicity consider a 2-D square domain with 

*V. K. Naik and S. Ta'asan, A methodology for implementing multigrid methods in solving JVavier-Stoies equa- 
tions on a hypercube multiprocessor system, in preparation. 
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2 l x 2 l points on the highest level L . We divide the domain on the finest level into 
2* partitions along the X direction and 2* partitions along the Y direction. Thus we 
get 2* + * partitions with each partition having 2 i_ * points along the X direction and 
2 l ~ * points along the Y direction. We map the partitions onto the hypercube nodes 
in a one-to-one fashion using the binary reflected grey code scheme [Chan 1986]. We 
further assume that a fixed region partitioning strategy is used, i.e., on the succes- 
sive coarser levels each partition contains regions of the domain, formed by the 
points that are the coarse level counterparts of the fine level points of the partition. 
Under this partitioning scheme, in moving from level L to level max(ar , y ) the 
number of points associated with each partition decreases by a factor of four. Below 
level max(i , y ) each partition has at most one line of points. With further coarsening 
the number of points per partition is halved until level min(ar, y ) is reached. On that 
level each partition has at most one point of the domain. Furthermore, in moving 
from level / to level l -l, where max(x , y) > l > min(x, y), the number of parti- 
tions having any points and hence any computational work reduces by a factor of 
two. When / is less than or equal to min(z , y) this number reduces by a factor of 
four. 

With the above described properties of the partitioning scheme, it is possible to 
make some predictions about the performance of the various multigrid algorithms 
under some assumptions about the communication costs. Specifically, it is possible to 
develop some analytical bounds on the speedups or efficiency of the system with a 
given set of communication parameters. Here we consider some simple cases and 
present some experimental results. Detailed analytical results are presented else- 
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where 1 . 

Consider the case where there are as many partitions as there are points on the 
finest level i.e., each processor is assigned one point of the domain. In the above 
notation this means that x = y = L . Assume that there are no communication 
costs and so the evaluated performance parameters will represent the upper bounds. 
We consider speedup or efficiency of the system as a measure of the performance. 
We define the speedup of a system with N processors as the ratio of the time taken 
by a single processor to solve a problem to the time taken by N processors to solve 
the same problem using the same algorithm. The efficiency of this system is obtained 
by dividing the speedup by N . 

The total computational cost incurred with N processors is bounded by the 
computational cost of the processor that performs work on all the levels. Now for the 
case considered here the work on any level is equal to the work associated with a sin- 
gle point and so, the total computational cost C N of the system is given by, 

Cn = Yj w < 

f—i 

where, V { is the total number of visits to level i and is the maximum work per 
processor on level « - a constant in this case. Thus, 

j eW 

speedup - — 

E ^ 

■-1 


1 V. K. Naik and S. Ta’asan, Performance studies of multigrid algorithms implemented on menage passing ar- 
chitectures, in preparation. 
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where, c 


16 
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for V Algorithm 

for F Algorithm 
for W Algorithm 


Note that here N — 4 1 . Depending on the type of algorithm chosen, the sum over 
total number of visits varies. It can be shown that for the V algorithm this sum is 
0(L 2 ). It is 0(L 3 ) and 0(2 L ) for the F and W algorithms, respectively. Thus, the 

maximum speedup with N processors for the V algorithm is Ol — 1, for the F 

{ log 4 N J 

algorithm it is and for the w algorithm it is 0(iV 2 ). This shows that 

when there are as many processors as there are number of points on the fine level, 
the speedup for the W algorithm is far from being optimal even when the communi- 
cation costs are ignored. 


For the cases where the number of points assigned to each partition on the 
highest grid is more than one, the expressions for the speedups are more complex. In 
general the bounds improve. Here we present some experimental results. The effects 
of the algorithm dependent properties and of the partition size on the computational 
efficiency are shown in Fig. 3. The results shown in this figure are obtained by 
measuring only the computational costs on the iPSC. The efficiencies for each algo- 
rithm are normalized with respect to the computational cost of solving the problem 
using that algorithm on a uniprocessor. Thus, although the efficiencies of the three 
algorithms shown in Fig. 3 cannot be compared directly against one another, these 
curves provide information about the relative degradation in performance as more 
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processors are added or as the partition size is decreased and these trends can be 
compared. Clearly, the sensitivity to the partition size is different for the three algo- 
rithms; V algorithm is the least sensitive and W algorithm is the most sensitive. 
Another point to be noted is that for the smaller size hypercubes i.e., when the parti- 
tion size is large, all three algorithms show improvements in efficiency and the 
difference in the efficiencies of the three algorithms decreases. The effect of the size of 
the partitions is shown explicitly in Fig. 4 for the F algorithm. The other two algo- 
rithms show similar trends. As in Fig. 3 the measurements are made without includ- 
ing the communication costs. The results presented in these two figures suggest that 
even if communication is instantaneous, the attainable speedup or efficiency is low if 
the amount of work per partition is small. If the communication costs are included 
in the measurements then the performance deteriorates as shown by Fig. 5 for the F 
algorithm. 

The efficiency discussed above represents a measure of the ability of an algo- 
rithm to keep the processors of the system busy. It does not include the effect of the 
numerical properties of the algorithm. If one is interested in the minimum overall 
cost of solving the problem, then both of these properties must be taken into 
account. The numerical properties are usually dependent on the problem being con- 
sidered and so cannot be characterized easily. For the V algorithm these properties 
sometimes depend on the mesh size also. For the problem we are considering here 
both the F and W algorithms need about two FMG cycles to solve a 128 x 128 prob- 
lem to the level of discretization error, whereas the V algorithm takes about seven 
cycles for the same. To compare the three algorithms more accurately we compute 
the efficiencies using the best sequential timings which for this problem are given by 
the F algorithm. We refer to such an efficiency as the normalized efficiency. In all 
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cases the problem is solved to get the same level of numerical accuracy. These 
results are shown in Fig. 6. Note that here the communication costs are included in 
computing the normalized efficiency. It can be seen that when the partition sizes are 
large and the hypercube size is small, both the F and W algorithms perform better 
than the V algorithm in spite of the adverse communication costs. For small parti- 
tion sizes the V algorithm may perform better even though its convergence proper- 
ties are inferior. 

4-8 Partition Shape 

When the number of points per partition on the finest level is more than one, 
the shape of the partitions is another parameter that has to be taken into account. 
Reed et al. [Reed 1986] have discussed in detail the combined effect of the iteration 
stencil, the partition shape, and the communication parameters of the underlying 
architecture on the total communication cost. Their discussion concentrates on 
minimizing the communication cost assuming that the computational work is evenly 
distributed and remains the same through out the computation. For multigrid algo- 
rithms, the fact that the computational work decreases on the coarser levels must 
also be taken into account. We explain this point with the help of an example. 
Consider a domain with 64 by 64 points on the fine level. Assume that 64 partitions 
are to be made on the fine level. In the fixed region partitioning scheme, if we use 
square partitions, then each partition has at least one point on levels 3, 4, 5, and 6. 
(Level 1 is the coarsest level.) On the other hand if one were to partition the domain 
in strips (one column of 64 points in each partition, for example), then only on level 
6 would all partitions have some points assigned to them. Note that in both cases 
each partition has the same computational work on the finest level. Thus among 
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squares, rectangles, and strips, squares balance the computational load best. 

Experimental results showing the efficiency of partitions with different shapes in 
balancing the computational load are shown in Fig. 7. Note that the communication 
costs are not taken into account. Here a domain with 64x64 points on the fine level is 
subdivided into 64 partitions. The four different cases considered consisted of parti- 
tions with 8x8 points (8 points in x direction and 8 in y direction), 16x4 points, 32x2 
points, and 64x1 points on the fine level. In the first case we get square partitions 
whereas in the last case we get strips. As expected the squares balance the computa- 
tional load better than any other shapes considered. The results shown here 
correspond to the F algorithm. For the W algorithm these will be more pronounced, 
but less so for the V algorithm. 

When communication costs are introduced the shape of the partitions may 
affect the performance differently. For the iPSC the cost of initializing a message is 
orders of magnitude higher than sending a single byte across a channel. A single 
packet can contain up to 1024 bytes. In addition, all the channels leaving or entering 
a node cannot be effectively utilized for simultaneously sending or receiving messages. 
For the problem sizes we have considered here, it turns out that the communication 
costs on the iPSC for the strips are less than those for the squares. This is shown in 
Fig. 8 for the F algorithm applied to a problem with 64x64 points on the highest 
level. Here the efficiency is based on the computation plus communication cost. It is 
obvious from Figures 7 and 8, that for the problem sizes we are considering, the com- 
munication costs incurred with strips are much less than those for the squares. In 
general this is not true. The problem sizes considered here are special cases and it 
can be easily shown that for strips, asymptotically, the total number of packets for 
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the two messages sent out per exchange of information with the neighboring parti- 
tions is greater by a constant factor than those for the four messages sent out by the 
square partitions when the number of interior points per partition is the same [Reed 
1986]. 

4-8 Schemes for Reducing the Communication Costs 

In the partitioning schemes considered above, the regions of the domain are per- 
manently assigned to the processors on all the levels even when the associated com- 
putational work is small. Sometimes it is advantageous to resort to a shifting region 
partitioning scheme. In this scheme below a certain level /* the work on the entire 
domain is shifted to one node so that on all the successive coarser levels there is no 
communication cost. On levels / * and above the computational work is uniformly 
distributed among all the partitions, but below level /* the computation is serialized. 
Thus every time there is transition between levels /* and / * - 1 either the data has 
to be gathered to one partition or scattered to all partitions from one partition. This 
scheme performs well if l * is such that 

Yj f Cpurt, + ^part, ) > Yj C Dom , + C ( • + S, . . 
f =1 V ' (=1 

where, (7 J)art( and C p ' aTtl denote the computation and communication costs, respec- 
tively, associated with a partion on level /. Cj) omi is the computation cost associated 
with the entire domain on level / . G t . and S ( . are the costs of gathering and scatter- 
ing the domain on level /*, respectively. 

Experimental results showing the performance improvements brought about by 
serializing the work below some level /* by moving all the regions to a single node, 
are shown in Fig. 9. Here the percent increase in efficiency by serializing the work 
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below levels 2 through 5 on a 16 PE hypercube is shown for the two problems with 
the finest level domain having 64x64 and 128x128 points. In this figure /* = 1 
corresponds to the fixed region partitioning scheme i.e., no moving takes place. It can 
be seen that the performance peeks out at a particular value of / * . The savings in 
communication costs achieved by serializing the work above this level is offset by the 
increase in the computation cost. Note that when the problem size is small or when 
the size of the partitions assigned to each processor is small, the gains are higher. 
Here each partition has a smaller piece of work on the highest level and so the com- 
munication costs are more dominant. By serializing the computation below a certain 
level, the percentage reduction in the total cost is higher than that in the bigger size 
problem. In Fig. 10 we show the effect of the above described partitioning scheme 
when the computing power is increased by adding more processors. Note that for 
the larger size hypercube, the cost of scattering and gathering the data is also higher. 
But now the computational work associated with each partition has decreased and so 
the communication costs form a higher proportion of the total cost. 

5. Conclusions 

We have considered the performance issues involved in implementing the mul- 
tigrid methods on a hypercube multiprocessor system. It is shown that both algo- 
rithm dependent as well as implementation dependent parameters affect the perfor- 
mance considerably and the selection of an algorithm or of a partioning scheme must 
be based on the combined effect of these parameters. We demonstrate by some 
experimental and analytical results that the best sequential algorithm may not 
always be the most suitable algorithm for parallel processing. At the same time an 
algorithm that gives the best speedups may not be the most suitable candidate 
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either. By using the problem of solving the 2-D Navier-Stokes equations as a model 
problem we show that a less well known method given by the F algorithm gives the 
best performance. We have also shown that when the communication costs are high 
instead of balancing the computational load it may be advantageous to sequentialize 
some parts of the work and avoid communication costs in those sections. 
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Fig. 3: Effect of Algorithm Characteristics 
on Computational Efficiency 
Square Partitions, 128x128 Domain 



Fig. 4: Effect of Partition Size 
on Computational Efficiency 
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Fig. 7: Effect of Partition Shape 
on Computational Efficiency 
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Fig. 8: Effect of Partition Shape 
on Efficiency (Communication Costs Inc.) 
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Fig. 9: Percent Change in Efficiency by 
Serializing Computation Below Level 1* 
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Fig. 10: Effect of Cube Size on 
Serializing Computation Below Level 1* 

Square Partitions, 128x128 Domain, F Cycle 
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